For this project, we will follow the DCOVAC process. The process is listed below:
DCOVAC – THE DATA MODELING FRAMEWORK
This dataset has 1345 rows and 16 variables. For this analysis, we will ignore the all of the categorical named variables that come with the dataset.
VARIABLES TO PREDICT WITH:
URBANICITY: where the accident took place(1=Rural Area, 2=Urban Area)
REGION: what region the accident took place(1=Northeast, 2=Midwest, 3=South, 4=West)
VE_TOTAL: total number of vehichles involved in the accident
PEDS: total number of pedestrians involved in the accident
MONTH: month of which the accident took place
DAY_WEEK: day of the week the accident took place(1=Sunday)
HOUR: hour of the day the accient took place(military time)
MAX_SEV: the max severity of the injuries(0 = No Apparent Injury, 1 = Possible Injury, 2 = Suspected Minor Injury, 3 = Suspected Serious Injury)
MAN_COLL: how many people were involved
WRK_ZONE: was it a work zone or not(0=NO, 1=YES)
VARIABLES WE WANT TO PREDICT
ALCOHOL: was there alcohol present in the accident(1=YES, 2=NO)
NUM_INJ: total number of injuries resulting from the accident
REGION URBANICITY VE_TOTAL PEDS
Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.00000
1st Qu.:3.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.00000
Median :3.000 Median :1.000 Median :2.000 Median :0.00000
Mean :2.949 Mean :1.212 Mean :1.845 Mean :0.07138
3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:0.00000
Max. :4.000 Max. :2.000 Max. :5.000 Max. :2.00000
NUM_INJ MONTH DAY_WEEK HOUR
Min. :0.0000 Min. : 1.000 Min. :1.000 Min. : 0.00
1st Qu.:0.0000 1st Qu.: 4.000 1st Qu.:3.000 1st Qu.:10.00
Median :1.0000 Median : 7.000 Median :4.000 Median :14.00
Mean :0.7903 Mean : 6.768 Mean :4.109 Mean :13.74
3rd Qu.:1.0000 3rd Qu.:10.000 3rd Qu.:6.000 3rd Qu.:17.00
Max. :8.0000 Max. :12.000 Max. :7.000 Max. :99.00
ALCOHOL MAX_SEV MAN_COLL WRK_ZONE
Min. :1.000 Min. :0.0000 Min. : 0.000 Min. :0.00000
1st Qu.:2.000 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.:0.00000
Median :2.000 Median :1.0000 Median : 1.000 Median :0.00000
Mean :1.928 Mean :0.9398 Mean : 2.938 Mean :0.02825
3rd Qu.:2.000 3rd Qu.:2.0000 3rd Qu.: 6.000 3rd Qu.:0.00000
Max. :2.000 Max. :5.0000 Max. :98.000 Max. :4.00000
REL_ROADNAME LGT_CONDNAME WEATHER1NAME
Length:1345 Length:1345 Length:1345
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Column {data-width=300} Column {data-height=500} ———————————————————————– ### Transform Variables
# A tibble: 2 × 2
ALCOHOL n
<chr> <int>
1 1 97
2 2 1248
We can see we have about 93% of the data as no alcohol involved in the accident. Looking at the potential predictors related to ALCOHOL, the strongest relationships are between REGION, MAX_SEV, VE_TOTAL, and NUM_INJ.
We see the largest concentration of values around 0-1 injuries. Looking at the potential predictors related to MEDV, the strongest relationships occur between PEDS, MAX_SEV, and VE_TOTAL. The data is also skewed to the left. We can see a large number of values around 0-2 injuries because there is usually none or only a few injuries in a car accident.
For this analysis we will use a Linear Regression Model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| MAX_SEV | 0.640 | 0.019 | 34.458 | 0.000 |
| VE_TOTAL | 0.320 | 0.043 | 7.472 | 0.000 |
| WEATHER1NAMEReported as Unknown | -1.571 | 0.510 | -3.084 | 0.002 |
| WEATHER1NAMEFog, Smog, Smoke | -0.449 | 0.242 | -1.858 | 0.063 |
| WRK_ZONE2 | -1.259 | 0.707 | -1.781 | 0.075 |
| HOUR | 0.005 | 0.003 | 1.775 | 0.076 |
| URBANICITYDusk | -0.235 | 0.137 | -1.709 | 0.088 |
| WEATHER1NAMEOther | 0.805 | 0.500 | 1.611 | 0.108 |
| REGION | 0.040 | 0.026 | 1.525 | 0.127 |
| WRK_ZONE1 | 0.355 | 0.238 | 1.490 | 0.137 |
| URBANICITYDaylight | -0.072 | 0.054 | -1.341 | 0.180 |
| WEATHER1NAMERain | 0.078 | 0.064 | 1.229 | 0.219 |
| REL_ROADNAMEIn Parking Lane/Zone | -0.780 | 0.733 | -1.064 | 0.288 |
| REL_ROADNAMEOutside Trafficway | -0.696 | 0.751 | -0.927 | 0.354 |
| URBANICITYDawn | -0.151 | 0.165 | -0.917 | 0.359 |
| MAN_COLL | -0.003 | 0.003 | -0.828 | 0.408 |
| PEDS | 0.070 | 0.085 | 0.823 | 0.411 |
| REL_ROADNAMEOn Roadside | -0.558 | 0.725 | -0.770 | 0.442 |
| REL_ROADNAMEOn Roadway | -0.547 | 0.719 | -0.760 | 0.448 |
| REL_ROADNAMEOn Shoulder | -0.558 | 0.755 | -0.739 | 0.460 |
| REL_ROADNAMEOff Roadway-Location Unknown | -0.608 | 0.879 | -0.692 | 0.489 |
| REL_ROADNAMEGore | -0.674 | 1.015 | -0.664 | 0.507 |
| MONTH | 0.003 | 0.006 | 0.520 | 0.603 |
| URBANICITYDark - Unknown Lighting | 0.142 | 0.276 | 0.513 | 0.608 |
| URBANICITYOther | -0.337 | 0.708 | -0.476 | 0.634 |
| WRK_ZONE4 | -0.104 | 0.289 | -0.360 | 0.719 |
| WEATHER1NAMENot Reported | -0.025 | 0.084 | -0.301 | 0.764 |
| URBANICITYNot Reported | -0.063 | 0.235 | -0.269 | 0.788 |
| WEATHER1NAMECloudy | -0.013 | 0.061 | -0.208 | 0.836 |
| WRK_ZONE3 | 0.128 | 0.706 | 0.181 | 0.856 |
| DAY_WEEK | 0.002 | 0.010 | 0.179 | 0.858 |
| URBANICITYDark - Not Lighted | 0.011 | 0.080 | 0.135 | 0.892 |
| REL_ROADNAMEOn Median | -0.070 | 0.735 | -0.095 | 0.925 |
| WEATHER1NAMESleet or Hail | 0.049 | 0.719 | 0.068 | 0.946 |
| WEATHER1NAMESnow | 0.007 | 0.170 | 0.043 | 0.966 |
| (Intercept) | -0.013 | 0.737 | -0.018 | 0.986 |
After examining this model, we determine that there are some predictors that are not important in predicting the number of injuries, so a pruned version of the model is created by removing predictors that are not significant.
For this analysis we will use a pruned Linear Regression Model. We removed URBANICITY, PEDS, MONTH, DAY_WEEK, HOUR, WRK_ZONE, and REL_ROADNAME.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| MAX_SEV | 0.642 | 0.018 | 36.493 | 0.000 |
| VE_TOTAL | 0.289 | 0.033 | 8.867 | 0.000 |
| (Intercept) | -0.489 | 0.099 | -4.965 | 0.000 |
| WEATHER1NAMEReported as Unknown | -1.418 | 0.506 | -2.803 | 0.005 |
| REGION | 0.050 | 0.026 | 1.927 | 0.054 |
| WEATHER1NAMEFog, Smog, Smoke | -0.408 | 0.237 | -1.723 | 0.085 |
| WEATHER1NAMERain | 0.098 | 0.063 | 1.563 | 0.118 |
| WEATHER1NAMEOther | 0.776 | 0.500 | 1.551 | 0.121 |
| MAN_COLL | -0.003 | 0.003 | -0.933 | 0.351 |
| WEATHER1NAMENot Reported | -0.042 | 0.081 | -0.515 | 0.607 |
| WEATHER1NAMESleet or Hail | -0.189 | 0.707 | -0.267 | 0.789 |
| WEATHER1NAMESnow | 0.030 | 0.170 | 0.179 | 0.858 |
| WEATHER1NAMECloudy | -0.002 | 0.061 | -0.035 | 0.972 |
After examining this model, looking at the residual plots we can see that there are some issues with our data. The residual vs. fitted plot seems to have a lot of patterns in it which mean that the model is unable to capture all the systematic variations within the data.There also seems to be outliers at the top of the residual plot. The Q-Q residual plot is pretty curved and only fall on the line near the middle of the plots, with the beggining and ends curving away from the line.This means that the residuals deviate from the expected normal deviation, resulting in lighter and heavier than predicted normality.
Reducing the predictors that did not help with prediction of the number of injuries resulting form a car crash and did not have a big impact our fit statistics (R-square and RMSE (root mean squared error)).
Row {data-height=900}
*
Overall *
As we can see from the model comparison that the best model to use for predicting if there was alcohol involved in the crash or not is the neural network regression. This has an over high r-squared compared to the other models as well as a low RMSE when compared to the others. Although these numbers are not very high in accuracy and low in error, it is the best out of the 4 models we are comparing.
In Conclusion, we can see that our predictors do not help very well to predict whether or not alcohol was involved in the car accident as well as the number of injuries resulting from the car accident.
Can the day of the week predict weather or not there was alcohol present in the crash? The day of the week has influence on weather or not there was alcohol involved. Saturday and Sunday have the most number of car crashes with alcohol involved. This makes sense because drinking occurs most primarily on the weekends compared to the weekdays.
Does the number of vehicles predict the number of injuries? We can see that there is a higher number of injuries predicted when there is more number of vehicles involved in the crash.
What region has the most crashes? We can see that the South (MD, DE, DC, WV, VA, KY, TN, NC, SC, GA, FL, AL, MS, LA, AR, OK, TX) has the most number of crashes at about 60% of the total and about 6.3% being alcohol related crashes.
Which Variables best predict the number of injuries? We can see that the best predictors to use while predicting the total number of injuries involved were the number of vehicles, the max severity of the injuries, and the region of which the car accident took place.
---
title: "Nationwide Car Crash 2019 Report"
output:
flexdashboard::flex_dashboard:
vertical_layout: scroll
source_code: embed
---
```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
```
```{r load_data}
CCREPORT <- read_csv("AccidentReport2019.csv")
```
Introduction {data-orientation=rows}
=======================================================================
Row {data-height=250}
-----------------------------------------------------------------------
### Overview
For this project, we will follow the DCOVAC process. The process is listed below:
DCOVAC – THE DATA MODELING FRAMEWORK
* DEFINE the Problem
* COLLECT the Data from Appropriate Sources
* ORGANIZE the Data Collected
* VISUALIZE the Data by Developing Charts
* ANALYZE the data with Appropriate Statistical Methods
* COMMUNICATE your Results
Row {data-height=650}
-----------------------------------------------------------------------
### The Problem & Data Collection
#### The Problem
* ***Problem Description***
The Car Crash 2019 Data used in this dashboard shows all of the car crash data from 2019 in the United States. We will examine the variables in the dataset to determine what helps to predict the number of injuries in a car crash as well as if there was alcohol involved or not.
### The Questions
* ***Aanlysis Questions***
1. Can the day of the week predict weather or not there was alcohol present in the crash?
2. Does the number of vehicles predict the number of injuries?
3. What region has the most crashes?
4. Which Variables best predict the number of injuries?
#### The Data
This dataset has 1345 rows and 16 variables. For this analysis, we will ignore the all of the categorical named variables that come with the dataset.
#### Data Sources
https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/
### Description of the Variables in the Dataset
* ***VARIABLES TO PREDICT WITH:***
* *URBANICITY*: where the accident took place(1=Rural Area, 2=Urban Area)
* *REGION*: what region the accident took place(1=Northeast, 2=Midwest, 3=South, 4=West)
* *VE_TOTAL*: total number of vehichles involved in the accident
* *PEDS*: total number of pedestrians involved in the accident
* *MONTH*: month of which the accident took place
* *DAY_WEEK*: day of the week the accident took place(1=Sunday)
* *HOUR*: hour of the day the accient took place(military time)
* *MAX_SEV*: the max severity of the injuries(0 = No Apparent Injury, 1 = Possible Injury, 2 = Suspected Minor Injury, 3 = Suspected Serious Injury)
* *MAN_COLL*: how many people were involved
* *WRK_ZONE*: was it a work zone or not(0=NO, 1=YES)
* ***VARIABLES WE WANT TO PREDICT***
* *ALCOHOL*: was there alcohol present in the accident(1=YES, 2=NO)
* *NUM_INJ*: total number of injuries resulting from the accident
Summary Statistics
=======================================================================
Column {data-width=650}
-------------------------------------------------------------------
### Summary Statistics
```{r, cache=TRUE}
#the cache=TRUE can be removed. This will allow you to rerun your code without it having to run EVERYTHING from scratch every time. If the output seems to not reflect new updates, you can choose Knit, Clear Knitr cache to fix.
#View data
#remove RAD due to it being an index so not a real continuous number
CCREPORT <- select(CCREPORT,-MONTHNAME,-REGIONNAME,-URBANICITYNAME,-DAY_WEEKNAME,-MAX_SEVNAME,-ALCOHOLNAME,-HARM_EVNAME,-YEAR,-HARM_EV)
summary(CCREPORT)
```
Column {data-width=300}
Column {data-height=500}
-----------------------------------------------------------------------
### Transform Variables
```{r, cache=TRUE}
CCREPORT <- mutate(CCREPORT,WRK_ZONE=as.factor(WRK_ZONE),
URBANICITY=as.factor(URBANICITY),
URBANICITY=as.factor(MONTH),
URBANICITY=as.factor(DAY_WEEK),
URBANICITY=as.factor(HOUR),
URBANICITY=as.factor(ALCOHOL),
URBANICITY=as.factor(MAX_SEV),
URBANICITY=as.factor(WEATHER1NAME),
URBANICITY=as.factor(REL_ROADNAME),
URBANICITY=as.factor(LGT_CONDNAME))
```
#### ALCOHOL (Alcohol Involved?)
```{r, cache=TRUE}
tibble::as_tibble(select(CCREPORT,ALCOHOL) %>%
table())
```
#### NUM_INJ (High or Low Median Value)
<!--Instructions to import .jpg or .png images
use getwd() to see current path structure
copy file into same place as .Rmd file
put the path to this file in the link
format:  -->

Data Viz #1
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables
#### ALCOHOL YES(1)/NO(2)
```{r, cache=TRUE}
as_tibble(select(CCREPORT,ALCOHOL) %>%
table()) %>%
ggplot(aes(y=n,x=ALCOHOL)) + geom_bar(stat="identity")
```
We can see we have about 93% of the data as no alcohol involved in the accident. Looking at the potential predictors related to ALCOHOL, the strongest relationships are between REGION, MAX_SEV, VE_TOTAL, and NUM_INJ.
Column {data-width=500}
-----------------------------------------------------------------------
### Transform Variables
```{r, cache=TRUE}
ggpairs(select(CCREPORT,ALCOHOL,NUM_INJ,REGION,URBANICITY,VE_TOTAL,PEDS,MONTH))
```
Data Viz #2
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables
#### MEDV
```{r, cache=TRUE}
ggplot(CCREPORT, aes(x = NUM_INJ)) +
geom_bar()
```
We see the largest concentration of values around 0-1 injuries. Looking at the potential predictors related to MEDV, the strongest relationships occur between PEDS, MAX_SEV, and VE_TOTAL. The data is also skewed to the left. We can see a large number of values around 0-2 injuries because there is usually none or only a few injuries in a car accident.
Column {data-width=500}
-----------------------------------------------------------------------
### Transform Variables
```{r, cache=TRUE}
ggpairs(select(CCREPORT,ALCOHOL,NUM_INJ,DAY_WEEK,HOUR,MAX_SEV,MAN_COLL))
```
NUM_INJ Analysis {data-orientation=rows}
=======================================================================
Row
-----------------------------------------------------------------------
### Predict Number of Injuries
For this analysis we will use a Linear Regression Model.
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
NUM_INJ_lm <- lm(NUM_INJ ~ . -ALCOHOL,data = CCREPORT)
summary(NUM_INJ_lm)
```
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(NUM_INJ_lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(NUM_INJ_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(NUM_INJ_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r,include=FALSE, cache=TRUE}
#knitr::kable(summary(NUM_INJ_lm)$coef, digits = 3) #pretty table output
summary(NUM_INJ_lm)$coef
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(NUM_INJ_lm))[,4])
out <- coef(summary(NUM_INJ_lm))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(NUM_INJ_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```
Row
-----------------------------------------------------------------------
### Analysis Summary
After examining this model, we determine that there are some predictors that are not important in predicting the number of injuries, so a pruned version of the model is created by removing predictors that are not significant.
Row
-----------------------------------------------------------------------
### Predict total number of injuries Final Version
For this analysis we will use a pruned Linear Regression Model. We removed URBANICITY, PEDS, MONTH, DAY_WEEK, HOUR, WRK_ZONE, and REL_ROADNAME.
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
NUM_INJ_lm <- lm(NUM_INJ ~ . -ALCOHOL -URBANICITY -PEDS -MONTH -DAY_WEEK -HOUR -WRK_ZONE - REL_ROADNAME -LGT_CONDNAME,data = CCREPORT)
summary(NUM_INJ_lm)
```
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(NUM_INJ_lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(NUM_INJ_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(NUM_INJ_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(NUM_INJ_lm)$coef, digits = 3) #pretty table output
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(NUM_INJ_lm))[,4])
out <- coef(summary(NUM_INJ_lm))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(NUM_INJ_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```
Row
-----------------------------------------------------------------------
### Analysis Summary
After examining this model, looking at the residual plots we can see that there are some issues with our data. The residual vs. fitted plot seems to have a lot of patterns in it which mean that the model is unable to capture all the systematic variations within the data.There also seems to be outliers at the top of the residual plot. The Q-Q residual plot is pretty curved and only fall on the line near the middle of the plots, with the beggining and ends curving away from the line.This means that the residuals deviate from the expected normal deviation, resulting in lighter and heavier than predicted normality.
Reducing the predictors that did not help with prediction of the number of injuries resulting form a car crash and did not have a big impact our fit statistics (R-square and RMSE (root mean squared error)).
ALCOHOL Analysis {data-orientation=rows}
=======================================================================
Row {data-height=900}
-----------------------------------------------------------------------
### Predict Alcohol involvness

### Regression Tree

### Neural Nets

### Bootstrap Forest

Row
-------------------------------------
### Regression/ Estimation Model Comparison

* ***Overall*** *
As we can see from the model comparison that the best model to use for predicting if there was alcohol involved in the crash or not is the neural network regression. This has an over high r-squared compared to the other models as well as a low RMSE when compared to the others. Although these numbers are not very high in accuracy and low in error, it is the best out of the 4 models we are comparing.
Conclusion
=======================================================================
### Summary
In Conclusion, we can see that our predictors do not help very well to predict whether or not alcohol was involved in the car accident as well as the number of injuries resulting from the car accident.
**Can the day of the week predict weather or not there was alcohol present in the crash?**
The day of the week has influence on weather or not there was alcohol involved. Saturday and Sunday have the most number of car crashes with alcohol involved. This makes sense because drinking occurs most primarily on the weekends compared to the weekdays.
**Does the number of vehicles predict the number of injuries?**
We can see that there is a higher number of injuries predicted when there is more number of vehicles involved in the crash.
**What region has the most crashes?**
We can see that the South (MD, DE, DC, WV, VA, KY, TN, NC, SC, GA, FL, AL, MS, LA, AR, OK, TX) has the most number of crashes at about 60% of the total and about 6.3% being alcohol related crashes.
**Which Variables best predict the number of injuries?**
We can see that the best predictors to use while predicting the total number of injuries involved were the number of vehicles, the max severity of the injuries, and the region of which the car accident took place.